Machine learning of plasma metabolome identifies biomarker panels for metabolic syndrome: findings from the China Suboptimal Health Cohort

Background Metabolic syndrome (MetS) has been proposed as a clinically identifiable high-risk state for the prediction and prevention of cardiovascular diseases and type 2 diabetes mellitus. As a promising “omics” technology, metabolomics provides an innovative strategy to gain a deeper understanding of the pathophysiology of MetS. The study aimed to systematically investigate the metabolic alterations in MetS and identify biomarker panels for the identification of MetS using machine learning methods. Methods Nuclear magnetic resonance-based untargeted metabolomics analysis was performed on 1011 plasma samples (205 MetS patients and 806 healthy controls). Univariate and multivariate analyses were applied to identify metabolic biomarkers for MetS. Metabolic pathway enrichment analysis was performed to reveal the disturbed metabolic pathways related to MetS. Four machine learning algorithms, including support vector machine (SVM), random forest (RF), k-nearest neighbor (KNN), and logistic regression were used to build diagnostic models for MetS. Results Thirteen significantly differential metabolites were identified and pathway enrichment revealed that arginine, proline, and glutathione metabolism are disturbed metabolic pathways related to MetS. The protein-metabolite-disease interaction network identified 38 proteins and 23 diseases are associated with 10 MetS-related metabolites. The areas under the receiver operating characteristic curve of the SVM, RF, KNN, and logistic regression models based on metabolic biomarkers were 0.887, 0.993, 0.914, and 0.755, respectively. Conclusions The plasma metabolome provides a promising resource of biomarkers for the predictive diagnosis and targeted prevention of MetS. Alterations in amino acid metabolism play significant roles in the pathophysiology of MetS. The biomarker panels and metabolic pathways could be used as preventive targets in dealing with cardiometabolic diseases related to MetS. Supplementary Information The online version contains supplementary material available at 10.1186/s12933-022-01716-0.

diseases (CVD) and type 2 diabetes mellitus (T2DM) in the future [2]. Depending on the International Diabetes Federation (IDF) definition of MetS, the prevalence of MetS is approximately 25% of all adults in the world [3]. MetS and its consequent chronic diseases lead to high morbidity and mortality rates. In 2016, CVD resulted in 17.9 million deaths [4], and 6.7 million individuals died from T2DM in 2021 worldwide [5]. As these cardiometabolic diseases are among the leading causes of death worldwide, MetS is still a global health issue.
MetS has a multifaceted etiology, involving complex interactions between genetic and environmental factors [6]. The pathophysiological mechanism of MetS is characterized by abnormal metabolism, including dysregulation of glucose and lipid metabolism [7], storage of adipose tissue [8], and chronic low-grade inflammation [9]. Although increasing evidence has shown that insulin resistance and obesity play essential roles in the pathophysiology of MetS [10,11], several other factors such as increase in cellular oxidative stress [12], low mitochondrial function [13], and dysregulation of the hypothalamic-pituitary-adrenal [14] can also be involved in its pathogenesis. Considering the multi-factorial pathophysiology of MetS, it is inevitable to understand and study the disease from a systemic point of view.
To comprehensively investigate the metabolic characterization of MetS and its role in the development of consequent cardiometabolic diseases, several attempts have been made to screen biomarkers using various omics technologies, including metabolomics [15]. Metabolomics, an emerging "omics" technology, is the profiling of metabolites in a biological system [16]. With the help of metabolomics, the pathophysiological characteristics of MetS have been further explored by looking for potential metabolic biomarkers that provide strong support for the diagnosis and treatment of MetS. These new metabolic insights could lead to a paradigm shift in how preventive interventions and treatment targets are being discovered [17]. In recent years, studies have identified several MetS related metabolic pathways, including amino acid metabolism, glutathione production, gluconeogenesis, and tricarboxylic acid cycle in American, Japanese, and Dutch cohorts [18][19][20]. However, to the best of our knowledge, the plasma metabolome of MetS patients has not been systematically profiled in a large Chinese cohort to identify biomarkers for the diagnosis of MetS.
The analysis of metabolomics big data is complicated due to its complex structure, such as high dimensionality, high noise levels, and missing values. Conventional statistics-based models are usually not suitable for the analysis of metabolomics big data. Therefore, machine learning methods have become popular for the analysis of metabolomics data, especially for the construction of prediction models based on potential biomarkers for the diagnosis of diseases [21]. Notably, the selection and optimization of machine learning algorithms are also crucial in the diagnosis of diseases.
Taking into account these necessities, the aim of the present study was to comprehensively investigate the plasma metabolic characteristics of MetS in a large wellestablished Chinese cohort-China Suboptimal Health Cohort Study (COACS), and to screen potential metabolic biomarkers for MetS using proton nuclear magnetic resonance ( 1 H-NMR)-based untargeted metabolome profiling. Univariate analysis and multivariate analysis were applied to identify potential metabolic biomarkers for the diagnosis of MetS. Metabolic pathway enrichment analysis was performed to discover which metabolic pathways and metabolites are crucial to the physiopathology of MetS. Four machine learning algorithms, including support vector machine (SVM), random forest (RF), k-nearest neighbor (KNN), and logistic regression were used to build diagnostic models for MetS based on potential metabolic biomarkers. The protein-metabolitedisease interaction network was also explored, so that novel insights or hypotheses regarding the progression of MetS towards its consequent cardiometabolic diseases might be obtained.

Study design and participants
A community-based study was conducted in a Chinese population who received routine health check-ups at the Jidong Oilfield Staff Hospital from September 2013 to June 2014. The present study was based on a welldesigned cohort named the COACS cohort, which was described previously [22]. All participants were required to meet the following inclusion criteria: (1) aged 18 to 65 years old; and (2) signed informed consent before participation. Participants were excluded if they currently suffering from one or more of the following diseases: (1) diabetes; (2) hypertension; (3) hyperlipemia; (4) cardiovascular or cerebrovascular conditions; (5) cancers; or (6) gout. All participants included in this study signed written informed consent forms. The study was approved by the Ethics Committee of the Jidong Oilfield Staff Hospital. Ethnics approval was given in compliance with the Declaration of Helsinki.

Measurements and sample collection
The demographic characteristics of participants, anthropometric measurements, and biochemical tests were collected as described in our previous study [22]. According to the IDF definition of MetS [23], the participants to be defined as having MetS must have abdominal obesity and any two of the following four phenotypes: (1) systolic blood pressure (SBP) ≥ 130 mmHg and/or diastolic blood pressure (DBP) ≥ 85 mmHg; (2) triglycerides (TG) ≥ 1.7 mmol/L; (3) fasting plasma glucose (FPG) ≥ 5.6 mmol/L; or (4) high-density lipoprotein cholesterol (HDL-C) < 1.03 mmol/L in men or < 1.29 mmol/L in women. Abdominal obesity was defined as waist circumference (WC) ≥ 90 cm in men and WC ≥ 80 cm in women [23]. After at least a 12-h fasting, blood samples were collected from all participants using venipuncture in the morning. The plasma samples were separated in the laboratory after centrifugation at 4 °C, for 10 min at 3000 × g. Then, the samples were stored at − 80 °C immediately, and freeze-thaw cycles were strictly avoided until metabolomic analysis [22].

Untargeted 1 H-NMR metabolomics analysis
Plasma samples were thawed at 4 °C. Once thawed, 200 μL of plasma was added to 400 μL of 0.045 M phosphate-buffered saline (PBS) prepared in deuterium oxide (D 2 O) and vortexed for 10 s. The mixture was centrifuged at 13,000 rpm for 15 min at 4 °C. Then 550 μL of supernatant was transferred into 5 mm NMR tubes for further analyses.
All 1 H-NMR spectra of plasma samples were acquired using a Varian VNMRS 600 MHz spectrometer (Agilent Technologies, USA) operating at a 1 H frequency of 599.77 MHz. One-domensional (1D) 1 H-NMR spectra were recorded using the Carr-Purcell-Meiboom-Gill (CPMG) pulse sequence. Each spectrum was acquired with 128 scans per sample using a spectral window of 16.4 ppm. The temperature was kept constant at 25 °C. Water suppression was achieved by using gated irradiation focused on the water frequency. All raw spectra files were obtained using VnmrJ software (Agilent Technologies, USA).

Data analysis and statistics
The study design and data analysis workflow are shown in Fig. 1. The raw NMR data were recorded in the form of free induction decay (FID) files which are time-domain spectra. Then the FID files were Fourier transformed into frequency domain spectra using NMRProcFlow software [24]. To remove effects of possible variations on the water suppression efficiency, the region of the water signal was discarded. NMRProcFlow was applied for the preprocessing of NMR spectra data, including phase correction, baseline correction, chemical shift referencing, and spectra alignment [24]. After the constant sum normalization of the spectra, the data matrix was exported to the ASICS R package for the identification and quantification of metabolites. ASICS is based on a library of pure metabolite spectra that is used as a reference to fit a unpenalized model followed by the control of the family wise error rate (FWER). Then the model fit provides the relative quantifications of metabolites in each sample [25].
The data are presented as the means and standard deviations (SDs) if the continuous variables conformed to normal distribution. Otherwise, medians and interquartile ranges (IQRs) were used in descriptive statistics. The differences in continuous variables between the MetS and control groups were tested by Student t-test or Wilcoxon rank-sum test. Categorical variables are represented as frequencies and percentages. The Chi-square test or Fisher's exact test was used to examine the differences in categorical variables between the two groups. The multiple testing corrections were controlled by using the false discovery rate (FDR).
The orthogonal partial least squares projection-discriminant analysis (OPLS-DA) model was performed to identify the metabolic biomarkers using SIMCA, version 14.1 (Umetrics, Umea, Sweden). To estimate the association between metabolic biomarkers and cardiometabolic risk factors, Spearman's rank correlation was performed and visualized using the "corrplot" R package. Metabolic pathway analysis and protein-metabolite-disease interaction network analysis were performed by using MetaboAnalyst [26], and Cytoscape, version 3.7.1 (National Institute of General Medical Sciences, Bethesda, USA) was used to create the interaction networks. The diagnostic models for MetS were constructed by using 4 machine learning algorithms, including SVM ("e1071" R package), RF ("randomForest" R package), KNN ( "kknn" R package), and logistic regression ("glm" R package). The receiver operating characteristic (ROC) curves were used to evaluate the predictive performance of the models. The area under the curve (AUC) and 95% bootstrap confidence intervals (CI) were also estimated.
Statistical analyses were performed using R, version 4.1.2 (R Foundation for Statistical Computing) and SPSS 25.0 (IBM Corporation, New York, USA). Two-tailed P < 0.05 was considered statistically significant.

Clinical characteristics of the study population
In total, 205 MetS patients and 806 healthy controls were analysed in the present study. The average ages of the MetS and control groups were 57.21 ± 10.00 and 47.05 ± 12.93 years, respectively. The levels of body mass index (BMI), SBP, DBP, hip circumference (HC), WC, waist-to-hip ratio (WHR), FPG, TG, total cholesterol (TC), low-density lipoprotein cholesterol (LDL-C), blood urea nitrogen (BUN), and creatinine (Cr) were significantly higher in the MetS group than those in the control group, whereas a significantly lower level of HDL-C was observed in the MetS group (all P < 0.05). Aside from these, significantly different frequencies of abdominal obesity, elevated blood pressure, elevated FPG, elevated TG, and reduced HDL-C phenotypes were observed between the two groups (all P < 0.05). The details about the demographic, biochemical, and anthropometric characteristics of the MetS patients and healthy controls are presented in Table 1.

Identification of metabolic biomarkers
The metabolome of 1011 plasma samples was analysed using 1 H-NMR, and the stacked NMR spectra are shown in Additional file 1. After the preprocessing of NMR spectra, identification and quantification of metabolites, and removal of missing values, 85 metabolites were identified successfully ( Fig. 2A and Additional file 2). The variable importance on projection (VIP) values of each metabolite was calculated by the OPLS-DA model, and the metabolites with VIP values > 1 were considered the  (Table 2 and Additional file 2).

Metabolic pathway enrichment analysis
Metabolic pathway analysis was performed to reveal the disturbed metabolic pathways related to MetS based on potential metabolic biomarkers. These metabolites were involved in 12 metabolic pathways ( Fig. 2C and Additional file 3). Among these 12 metabolic pathways, two pathways with P values < 0.05 and impact values > 0.00 were identified as arginine and proline metabolism, and glutathione metabolism pathways, respectively. The arginine and proline metabolism pathway included 38 metabolites in total, while 3 metabolites (guanidinoacetate, hydroxyproline, and L-ornithine) were measured in this study. The glutathione metabolism pathway included 28 metabolites in total, while 2 metabolites (pyroglutamic acid and L-ornithine) were measured in the present study ( Fig. 2C and Additional file 3).

Association between metabolic biomarkers and cardiometabolic risk factors
To investigate the potential relationships between 13 metabolic biomarkers and 14 cardiometabolic risk factors, Spearman's correlation coefficients were calculated (Additional file 4). The matrix of correlation coefficients is visualized in Fig. 3. Among the 13 metabolic biomarkers, 13 metabolites were significantly associated with TG, and 10 metabolites were associated with WC, WHR, SBP, FPG, HDL-C, LDL-C, and Cr, followed by 9 metabolites were associated with HC, BMI, and DBP, 6 metabolites were associated with BUN, and 5 metabolites were associated with age and TC (Additional file 4). The significant correlation coefficients ranged from − 0.335 to 0.534. D-Fucose showed the highest correlation with the cardiometabolic risk factors, associated with 9 of the 13 metabolic risk factors. The correlation coefficient between D-fucose and TG was highest (r = 0.534, P value < 0.001). There were five metabolites correlated with age (P values < 0.05), and the correlation coefficients ranged from − 0.238 to 0.188, which were relatively low. D-Maltose and Deoxyadenosine were associated with all the 14 cardiometabolic risk factors included in this study ( Fig. 2D and Additional file 4).

Protein-metabolite-disease interaction network
A protein-metabolite-disease interaction network was constructed to provide a comprehensive understanding of potential functional relationships among potential metabolic biomarkers, proteins, and diseases. Based on the previous knowledge of literature associations, biological pathways, similar structures, and similar functions, the interactions between metabolites and proteins were searched from the Search Tool for Interactions of Chemicals (STITCH) database [27]. There were 38 proteins associated with 10 metabolic biomarkers for MetS (Fig. 3). According to the association between metabolites and diseases in the Human Metabolome Database (HMDB) [28], the metabolitedisease interaction network was also constructed to explore the association between MetS-related metabolites and chronic diseases. Finally, 23 diseases were associated with 10 MetS-related metabolites (Fig. 3).

Diagnostic models for MetS using machine learning algorithms
After comprehensively profiling the metabolic biomarkers, four machine learning algorithms, including SVM, RF, KNN, and logistic regression, were performed to construct diagnostic models based on 13 metabolic biomarkers. The parameters of different models were tuned using ten-fold cross-validation on the whole dataset. Then, the parameters were applied to the whole dataset to provide final metrics of the suitability of the models for classifying individuals with MetS and healthy controls. Eventually, the kernel used in the SVM model was the radial kernel. The number of trees in the RF model was 500. The number of neighbours in the KNN model was 19. Then the diagnostic models based on 14 cardiometabolic risk factors were also built to compare the predictive ability with models based on metabolic biomarkers. The diagnostic performance of these eight models was shown in Table 3 and Fig. 4, and the AUCs ranged from 0.755 to 0.993 (Fig. 4).

Discussion
Identifying key metabolic biomarkers and pathways relevant to MetS and its progression towards cardiometabolic diseases is considered a viable strategy for the predictive diagnosis and targeted prevention of cardiometabolic diseases. In the present study, we comprehensively described the metabolomic biosignatures of MetS, and the metabolic biosignatures revealed significant differences between MetS patients and healthy participants. Based on the 13 potential metabolic biomarkers for MetS, the pathway analysis suggested that arginine and proline metabolism, and glutathione metabolism pathways were disturbed in MetS patients. Four machine learning algorithms, including SVM, RF, KNN, and logistic regression were used to build diagnostic models for MetS. ROC curve analysis showed that the AUCs of four models based on metabolic biomarkers ranged from 0.755 to 0.993. To our knowledge, the present study is the first to comprehensively provide metabolomic biosignatures of MetS based on a large well-established Chinese cohort by using 1 H-NMR-based metabolome profiling. Our findings unveiled that metabolome provides a valuable resource of biomarkers for the diagnosis and prevention of MetS and its consequent cardiometabolic diseases. These metabolomic biomarkers also provide a better insight into the critical metabolic pathways associated with MetS and a deeper understanding of its progression towards cardiometabolic diseases. Thus, the MetS-related metabolites and the metabolic patterns of metabolites can be used as potential diagnostic models for population risk stratification and targeted intervention of MetS towards chronic diseases, including CVD and T2DM. We identified significant differences between MetS patients and healthy controls in cardiovascular risk factors, including BMI, SBP, DBP, HC, WC, WHR, FPG, TG, TC, HDL-C, LDL-C, BUN, and Cr (Table 1). We additionally found that 13 metabolic biomarkers for MetS  were also significantly correlated with these cardiovascular risk factors (Fig. 2D). These metabolites may also be affected by these clinical risk factors. Considering that MetS is a constellation of closely related cardiometabolic risk factors, these candidate metabolic biomarkers for MetS could also be potential biomarkers for abdominal obesity, hypertension, hyperglycemia and dyslipidemia. Plasma concentrations of these metabolites may be important indicators of the pathophysiological mechanism of MetS and provide insights into effective treatments for cardiometabolic risk factors. Pathway analysis revealed that the arginine and proline metabolism pathways are associated with MetS. Guanidinoacetate, hydroxyproline, and L-ornithine are the measured metabolites that participate in arginine and proline metabolism. Arginine, a semi-essential amino acid, is one of the most metabolically versatile amino acids. It serves as a precursor for the synthesis of urea, polyamines, proline, nitric oxide, creatine, glutamate, and agmatine [29]. Numerous studies have suggested that intravenous use or dietary supplementation of arginine is beneficial in improving cardiovascular, pulmonary, renal, gastrointestinal, liver, and immune functions, as well as enhancing insulin sensitivity and maintaining tissue integrity [30]. The dynamic balance of L-arginine may be an endogenous determinant of arterial tone in hypertension [31]. Mirmiran et al. [32] found that plant-derived L-arginine could be a potentially protective factor against the development of MetS and its phenotypes, and higher intakes of animal-derived L-arginine could be a dietary risk factor for the development of MetS. The potential modulatory effects of L-arginine supplementation are currently considered a novel and effective strategy for the treatment and prevention of MetS and its phenotypes, including central obesity, hyperglycemia, and dyslipidemia [33,34]. In our study, a significantly higher level of guanidinoacetate was found in MetS patients. Otherwise, a significantly lower-level of L-ornithine was found in MetS patients. These findings supported that MetS and its phenotypes are associated with the imbalance of arginine metabolism, and these biomarkers can be used as new intervention targets for MetS and cardiometabolic risk factors.
Hydroxyproline, a nonessential amino acid, is a structurally and physiologically important amino acid in humans. Emerging evidence proves that the oxidation of hydroxyproline plays a significant role in regulating oxidative defense, apoptosis, and angiogenesis [35]. Studies have suggested that chronic low-grade inflammation and oxidative stress in obese individuals are the important underlying mechanism that led to the development of MetS through changed cellular and nuclear mechanisms, including impairments in DNA damage reparation and cell cycle regulation [12]. Capel et al. [36] observed that metabolites from arginine and proline metabolism pathways were significantly different between MetS patients and healthy controls. Targeted and untargeted metabolite profiling found that hydroxyproline could be a potential metabolic biomarker for cardiovascular diseases [37]. In the present study, the significantly lower level of hydroxyproline in MetS patients showed that plasma hydroxyproline was associated with MetS and its phenotypes of the cardiovascular system. The findings of the present study indicated that plasma hydroxyproline could be used as a potential biomarker for the progression of MetS towards cardiovascular diseases, and hydroxyproline metabolism could serve as treatment targets for MetS and cardiometabolic diseases.
Pyroglutamic acid and L-ornithine are the measured metabolites that participate in glutathione metabolism. Glutathione is a low-molecular-weight tripeptide composed of the amino acid glutamine, cysteine, and glycine [38]. It plays a pivotal role in maintaining redox balance, reducing oxidative stress, enhancing metabolic detoxification, and regulating the immune response [38]. A great body of evidence suggested that glutathione may be a potential biomarker and treatment target in various chronic, metabolic diseases, such as hypertension, T2DM, and CVD [39][40][41]. Sekhar et al. [42] found that patients with uncontrolled T2DM have severely decreased synthesis of glutathione. In the present study, significantly lower levels of pyroglutamic acid and L-ornithine in the glutathione metabolism pathway were observed in MetS patients. These findings showed that deficient synthesis of glutathione occurred in MetS patients, which indicated that elevated oxidative stress may play a significant role in the pathophysiology of MetS.
The metabolite-protein interaction network enables the visualization and exploration of interactions between metabolites and functionally related proteins. This visual network can be used to acquire innovative insights into the pathophysiology of MetS and its progression towards cardiometabolic diseases. According to the association between metabolites and diseases obtained from the HMDB database, a metabolite-disease interaction network was also produced to explore the disease-related metabolites. In the present study, MetS-related metabolic biomarkers were found to be associated with 23 diseases, such as Parkinson's disease, Alzheimer's disease, lung cancer, and schizophrenia. Some of these diseases were reported to be associated with MetS. The lower levels of L-ornithine, hydroxyproline, carnosine, and L-asparagine were observed in the individuals with MetS. All these four potential metabolic biomarkers for MetS were also found to be associated with Alzheimer's disease. Previous studies supported that MetS and T2DM are risk factors for Alzheimer's disease [43]. The underlying mechanism of MetS toward Alzheimer's disease may be involved in the aberrations in the amino acid metabolism in MetS patients.
Several limitations in the present study need to be addressed. Firstly, the causal effect was difficult to infer in data from a cross-sectional study design. The observed MetS-related metabolites may be the consequences rather than causes of MetS and its phenotypes. To investigate the causations between metabolic biomarkers and cardiometabolic risk factors, Mendelian randomization studies in the same cohort of participants are also needed. Secondly, given the semi-quantitative nature of untargeted metabolomics profiling, a targeted metabolomics study is underway against the same cohort to validate the potential biomarkers and pathways based on the findings in the present study. Despite the limitations, the present study has provided a novel strategy that plasma metabolomics offers an innovative alternative for the recognition of MetS. Building on the findings, further studies from diverse populations and geographical areas are warranted.

Conclusions
The early diagnosis of MetS has the potential to identify the patients who are at high risk of developing CVD and T2DM at early stages, and evidence-based intervention for MetS may be a cost-effective method for targeted prevention, and personalized intervention for cardiometabolic diseases, such as CVD and T2DM. A total of 13 metabolites, including trans-acotinic acid, methanol, guanidinoacetate, hydroxyproline, pyroglutamic acid, glutaconic acid, D-maltose, D-fucose, taurine, deoxyadenosine, L-ornithine, L-asparagine, and carnosine, were selected as candidate biomarkers for MetS. The present study revealed the potential value of metabolomic biomarkers for the predictive diagnosis of MetS. MetS patients have a universal metabolic disturbance. The significantly higher level of guanidinoacetate and significantly lower level of L-ornithine in MetS patients indicated that the disturbance of arginine metabolism plays a significant role in the pathophysiologic mechanism of MetS and its phenotypes. Hydroxyproline and glutathione metabolism also play potential roles in the pathophysiologic mechanism of MetS. These findings determined the potential utility of MetS-related metabolic biomarkers and pathways for targeted prevention and personalized therapy of cardiometabolic diseases.